-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Python extractor: overlay support #20206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b18b9ce
to
3015c12
Compare
b0c7a52
to
b5c8338
Compare
f75a392
to
63106c0
Compare
63106c0
to
b3a1ba5
Compare
b3a1ba5
to
fb23977
Compare
@tausbn I'm thinking this might be a good time to checkpoint this work and get it reviewed. In the last DCA run for full analysis on this PR (see above), overall analysis time is unaffected, though there are a few outstanding stage timing results that are probably noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds overlay support to the Python extractor by implementing infrastructure for incremental analysis through database overlays, without including overlay compilation functionality.
Key changes implemented:
- Database schema updates to support overlay metadata and change tracking
- Extractor modifications to handle overlay-specific file traversal and metadata management
- Path transformer support using updated environment variables
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
python/ql/lib/semmlecode.python.dbscheme | Adds databaseMetadata and overlayChangedFiles relations for overlay support |
python/ql/lib/semmle/python/Overlay.qll | Implements discard predicates to filter out obsolete entities during overlay analysis |
python/extractor/semmle/traverser.py | Modifies file traversal to only process changed files during overlay extraction |
python/extractor/semmle/worker.py | Adds support for writing base metadata output required for overlay operations |
python/extractor/semmle/path_rename.py | Updates path transformer to support new CODEQL_PATH_TRANSFORMER environment variable |
python/extractor/semmle/traverser.py
Outdated
with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f: | ||
data = json.load(f) | ||
changed_paths = data.get('changes', []) | ||
self.overlay_changes = { os.path.abspath(p) for p in changed_paths } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name self.overlay_changes
is inconsistent with the other instance variables which use snake_case
(self.exclude_paths
, self.recurse_files
, etc.). Consider renaming to self.overlay_changed_paths
for consistency.
Copilot generated this review using guidance from repository custom instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good. 👍
Do we have any tests for this? I feel like we might want to have a few CLI Integration tests to check that the overlay JSON files are being applied correctly. (The integration tests live here: https://github.com/github/codeql/tree/main/python/extractor/cli-integration-test)
Also, don't forget to update the extractor version here: https://github.com/github/codeql/blob/main/python/extractor/semmle/util.py#L13
(In this case, I think bumping it to 7.1.4 would be fine. We don't really have fixed rules for how to increase the version. The most important thing is that it changes so that we can tell from the log output what version of the extractor we're running.)
python/extractor/semmle/traverser.py
Outdated
if 'CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES' in os.environ: | ||
with open(os.environ['CODEQL_EXTRACTOR_PYTHON_OVERLAY_CHANGES'], 'r', encoding='utf-8') as f: | ||
data = json.load(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm debating whether we should have some exception handling here (substituting the empty list of changed files in case something goes wrong). Currently, if something ends up being messed up in the JSON, then I believe the whole extraction will just fail.
I don't have strong feelings about it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I also don't have strong opinions about whether file reading should fail loudly or warn and continue with a default (None
, i.e. full extraction). I guess I'll go for the latter. And also insert a logger statement with the value of the environment variable, as is the convention elsewhere in the extractor.
There are basic integration tests here but they depend on overlay compilation (not part of this commit), and also I'm still running into some issues on Windows (it appears that the path transformer is not working correctly there—currently debugging that). So maybe merging this should wait until I have that sorted. Otherwise, do you have an idea for an integration test for this functionality that doesn't also exercise complete overlay evaluation? |
The new name is required by overlay support.
with direct or indirect location links in dbscheme.
And don't add slash to start of path patterns on Windows.
…t-in files On Windows, we're getting e.g. the following mismatches, which could be due to case differences: "Skipped built-in file C:\hostedtoolcache\windows\Python\3.13.7\x64\Lib\multiprocessing\forkserver.py" vs "Extracted file C:\hostedtoolcache\windows\Python\3.13.7\x64\lib\asyncio\streams.py"
fb23977
to
f309dc6
Compare
I think I've figured out why path transformers weren't working on Windows and why built-in modules were being extracted (see latest commits). Now the integration test on the other PR passes. The only remaining thing now is solving some tuple count regressions uncovered through DCA, but that can be done independently of this PR. |
This PR adds overlay support to the Python extractor, but no overlay compilation (to be merged separately since it needs further testing, see this PR).
This PR also includes an initial pass at the discard predicates (see Overlay.qll), though these are ignored in full (non-overlay) evaluation; they probably still need to be tweaked, so I'm happy to move this commit to another PR and let this one be only about the extractor.
Roadmap:
CODEQL_EXTRACTOR_<LANG>_OVERLAY_BASE_METADATA_{IN,OUT}
)